Goto

Collaborating Authors

 prompt size


PromptBlack-box APIRaw runtime(= denoised runtime+ noise)Prompt has num_prompt_tokens, output hasnum_output_tokensChosen hardware and software(e.g., A100 GPUs and Megatron)Idealized runtimePrompt

Neural Information Processing Systems

Large language models (LLMs) are highly capable but also computationally expensive. Characterizing the fundamental tradeoff between inference efficiency and model capabilities is thus important, but requires an efficiency metric that is comparable across models from different providers. Unfortunately, raw runtimes measured through black-box APIs do not satisfy this property: model providers can implement software and hardware optimizations orthogonal to the model, and shared infrastructure introduces performance contention. We propose a new metric for inference efficiency called idealized runtime, that puts models on equal footing as though they were served on uniform hardware and software without performance contention, and a cost model to efficiently estimate this metric for autoregressive Transformer models. We also propose variants of the idealized runtime that incorporate the number and type of accelerators needed to serve the model. Using these metrics, we compare ten LLMs developed in 2022 to provide the first analysis of inference efficiency-capability tradeoffs; we make several observations from this analysis, including the fact that the superior inference runtime performance of certain APIs is often a byproduct of optimizations within the API rather than the underlying model.







Assessing Foundational Medical 'Segment Anything' (Med-SAM1, Med-SAM2) Deep Learning Models for Left Atrial Segmentation in 3D LGE MRI

arXiv.org Artificial Intelligence

Atrial fibrillation (AF), the most common cardiac arrhythmia, is associated with heart failure and stroke. Accurate segmentation of the left atrium (LA) in 3D late gadolinium-enhanced (LGE) MRI is helpful for evaluating AF, as fibrotic remodeling in the LA myocardium contributes to arrhythmia and serves as a key determinant of therapeutic strategies. However, manual LA segmentation is labor-intensive and challenging. Recent foundational deep learning models, such as the Segment Anything Model (SAM), pre-trained on diverse datasets, have demonstrated promise in generic segmentation tasks. MedSAM, a fine-tuned version of SAM for medical applications, enables efficient, zero-shot segmentation without domainspecific training. Despite the potential of MedSAM model, it has not yet been evaluated for the complex task of LA segmentation in 3D LGE-MRI. This study aims to (1) evaluate the performance of MedSAM in automating LA segmentation, (2) compare the performance of the MedSAM2 model, which uses a single prompt with automated tracking, with the MedSAM1 model, which requires separate prompt for each slice, and (3) analyze the performance of MedSAM1 in terms of Dice score(i.e., segmentation accuracy) by varying the size and location of the box prompt. Keywords: Foundational model, left atrial segmentation, Atrial fibrillation, Cardiac MRI (CMR), MedSAM, SAM, 3D LGE-MRI.


Generative Network Layer for Communication Systems with Artificial Intelligence

arXiv.org Artificial Intelligence

The traditional role of the network layer is the transfer of packet replicas from source to destination through intermediate network nodes. We present a generative network layer that uses Generative AI (GenAI) at intermediate or edge network nodes and analyze its impact on the required data rates in the network. We conduct a case study where the GenAI-aided nodes generate images from prompts that consist of substantially compressed latent representations. The results from network flow analyses under image quality constraints show that the generative network layer can achieve an improvement of more than 100% in terms of the required data rate.


K-ESConv: Knowledge Injection for Emotional Support Dialogue Systems via Prompt Learning

arXiv.org Artificial Intelligence

Automatic psychological counseling requires mass of professional knowledge that can be found in online counseling forums. Motivated by this, we propose K-ESConv, a novel prompt learning based knowledge injection method for emotional support dialogue system, transferring forum knowledge to response generation. We evaluate our model on an emotional support dataset ESConv, where the model retrieves and incorporates knowledge from external professional emotional Q\&A forum. Experiment results show that the proposed method outperforms existing baselines on both automatic evaluation and human evaluation, which shows that our approach significantly improves the correlation and diversity of responses and provides more comfort and better suggestion for the seeker.


Cheaply Evaluating Inference Efficiency Metrics for Autoregressive Transformer APIs

arXiv.org Artificial Intelligence

Large language models (LLMs) power many state-of-the-art systems in natural language processing. However, these models are extremely computationally expensive, even at inference time, raising the natural question: when is the extra cost of deploying a larger model worth the anticipated boost in capabilities? Better understanding this tradeoff fundamentally could benefit from an inference efficiency metric that is both (i) easily comparable across models from different providers, and (ii) representative of the true cost of running queries in an isolated performance environment. Unfortunately, access to LLMs today is largely restricted to black-box text generation APIs and raw runtimes measured through this interface do not satisfy these desiderata: model providers can apply various software and hardware optimizations orthogonal to the model, and models served on shared infrastructure are susceptible to performance contention. To circumvent these problems, we propose a new metric for comparing inference efficiency across models. This metric puts models on equal footing as though they were served (i) on uniform hardware and software, and (ii) without performance contention. We call this metric the \emph{idealized runtime}, and we propose a methodology to efficiently estimate this metric for autoregressive Transformer models. We also propose cost-aware variants that incorporate the number of accelerators needed to serve the model. Using these metrics, we compare ten state-of-the-art LLMs to provide the first analysis of inference efficiency-capability tradeoffs; we make several observations from this analysis, including the fact that the superior inference runtime performance of certain APIs is often a byproduct of optimizations within the API rather than the underlying model. Our methodology also facilitates the efficient comparison of different software and hardware stacks.